Project-Team:BAMBOO

Inria | Raweb 2014 | Presentation of the Project-Team BAMBOO | BAMBOO Web Site


	PDF	e-Pub

Previous |

Home | Next next

Section: New Results

Efficient Algorithms for analysing RNA-seq Data

In the last years, we had addressed the problem of identifying and quantifying variants (alternative splicing and genomic polymorphism) in RNA-seq data when no reference genome is available, without assembling the full transcripts. Based on the fundamental idea that each variant corresponds to a recognizable pattern, a bubble, in a de Bruijn graph constructed from the RNA-seq reads, we propose a general model for all variants in such graphs. We then introduced an exact algorithm, called KisSplice , to extract alternative splicing events. We had showed that it enables to identify more correct events than general purpose transcriptome assemblers.

The main time bottleneck in the KisSplice algorithm is the bubble enumeration step. Thus, in an effort to make our method as scalable as possible, we had modified Johnson's cycle listing algorithm (Johnson (1975)) to enumerate bubbles in general directed graphs, while maintaining the same time complexity. We now proposed, using a different enumeration technique, an algorithm to list bubbles with path length constraints in weighted directed graphs [29] . For a graph with $n$ vertices and $m$ edges, the method we propose lists all bubbles with a given source in $O (n (m + n l o g n))$ delay. Moreover, we experimentally showed that this algorithm is several orders of magnitude faster than the listing algorithm of KisSplice to identify bubbles corresponding to alternative splicing events.

Additionally, we showed that the same techniques used to list bubbles can be applied to one classical enumeration problem: $K$ -shortest paths problems [29] . We considered a different parameterisation of the $K$ -shortest paths problem: instead of bounding the number of $s t$ -paths, we bound the weight of the $s t$ -paths. We present a general scheme to list bounded length $s t$ -paths in weighted graphs that takes $O (n t (n, m))$ time per path, where $t (n, m)$ is the time for a single source shortest path computation. This algorithm uses memory linear in the size of the graphs, independent of the number of paths output. For undirected non-negatively weighted graphs, we also show an improved algorithm that lists all $s t$ -paths with bounded length in $O ((m + t (n, m)))$ time per path.

The main memory bottleneck in KisSplice is the construction and representation of the de Bruijn graph. Thus, again with the goal to make our method as scalable as possible, we propose a new compact way to build and represent a de Bruijn graph improving over the state of the art [22] . We show both theoretically and experimentally that our approach uses 30% to 40% less memory than such state of the art, with an insignificant impact on the construction time. Our de Bruijn graph representation is general, in other words it is not restricted to the variation finding or RNA-seq context, and can be used as part of any algorithm that represents NGS data with de Bruijn graphs.

A major issue when analysing transcriptomes using short sequencing reads is to be able to deal with repeats that are longer than the reads. We proposed a first explicit model for large families of inexact repeats in the de Bruijn Graphs generated from RNA-seq data [21] . Taking advantage of this modelling, we also proposed an efficient algorithm which enumerates alternative splicing events without traversing repeat-induced subgraphs, therefore offering a first answer to one the main question left open at the end of Gustavo Sacomoto's PhD [4] .

Motivated by previous work on the classical problem of listing cycles, we also studied from a more purely theoretical point of view how to list chordless cycles [28] . We thus developed an amortized $Õ (| V |)$ -delay algorithm for listing chordless cycles in undirected graphs. Chordless cycles are very natural structures in undirected graphs, with an important history and distinguished role in graph theory. The best known solution to list all the $C$ chordless cycles contained in an undirected graph $G = (V, E)$ takes $O (| E | 2 + | E | \cdot C)$ time. In this paper we provide an algorithm taking $Õ (| E | + | V | \cdot C)$ time. We also show how to obtain the same complexity for listing all the $P$ chordless $s t$ -paths in $G$ (where $C$ is replaced by $P$ ).

Previous |

Home | Next next